muHVT: Predicting Cells and Layers using predictLayerHVT to monitor entities over time

Zubin Dowlaty, Somya Shambhawi

2023-05-17

1 Abstract

The muHVT package is a collection of R functions to facilitate building topology preserving maps for rich multivariate data. Tending towards a big data preponderance, a large number of rows. A collection of R functions for this typical workflow is organized below :

  1. Data Compression: Vector quantization (VQ), HVQ (hierarchical vector quantization) using means or medians. This step compresses the rows (long data frame) using a compression objective

  2. Data Projection: Dimension projection of the compressed cells to 1D,2D and 3D with the Sammons Non-linear Algorithm. This step creates topology preserving map coordinates into the desired output dimension. Additionally, embedding can be learned from the compressed data to further enhance the representation of the multivariate data. Embedding provide a lower-dimensional space where high-dimensional vectors, such as word vectors, can be translated, making it easier to perform machine learning tasks on large inputs.

  3. Tessellation: Create cells required for object visualization using the Voronoi Tessellation method, package includes heatmap plots for hierarchical Voronoi tessellations (HVT). This step enables data insights, visualization, and interaction with the topology preserving map. Useful for semi-supervised tasks

  4. Prediction: Scoring new data sets and recording their assignment using the map objects from the above steps, in a sequence of maps if required

2 Data Understanding

In this vignette, we will use the Prices of Personal Computers dataset. This dataset contains 6261 observations and 6 features. The dataset observes the price from 1993 to 1995 of 486 personal computers in the US. The variables are price, speed, hd, ram, screen and ads.

Here, we load the training and the testing dataset.

set.seed(240)
# Load data from csv files
trainComputers <- read.csv("https://raw.githubusercontent.com/Mu-Sigma/muHVT/dev/vignettes/sample_dataset/trainComputers.csv")
testComputers <- read.csv("https://raw.githubusercontent.com/Mu-Sigma/muHVT/dev/vignettes/sample_dataset/testComputers.csv")

Let’s have a look at the scaled training dataset containing 5008 data points

# Quick peek
trainComputers <- scale(trainComputers) 
metric_list <- colnames(trainComputers)
scale_attr <- attributes(trainComputers)

trainComputers <- trainComputers %>% as.data.frame()
DT::datatable(trainComputers, rownames = FALSE,
              class = 'cell-border stripe',
              options = list(scrollX = TRUE))

Now let us check the structure of the training data and analyse its summary.

str(trainComputers)
#> 'data.frame':    5008 obs. of  6 variables:
#>  $ price : num  -1.297 -0.8 -1.136 -0.709 1.721 ...
#>  $ speed : num  -1.195 -0.783 -1.195 -1.195 -0.783 ...
#>  $ hd    : num  -1.3136 -1.2898 -0.8854 -0.8854 -0.0767 ...
#>  $ ram   : num  -0.718 -1.108 -0.718 0.063 1.624 ...
#>  $ screen: num  -0.615 -0.615 0.549 -0.615 -0.615 ...
#>  $ ads   : num  -2.39 -2.39 -2.39 -2.39 -2.39 ...
summary(trainComputers)
#>      price             speed               hd                ram          
#>  Min.   :-2.2216   Min.   :-1.1954   Min.   :-1.31361   Min.   :-1.10781  
#>  1st Qu.:-0.7520   1st Qu.:-0.7834   1st Qu.:-0.68562   1st Qu.:-0.71755  
#>  Median :-0.1276   Median : 0.0920   Median :-0.07666   Median : 0.06296  
#>  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.00000   Mean   : 0.00000  
#>  3rd Qu.: 0.6269   3rd Qu.: 0.9159   3rd Qu.: 0.44666   3rd Qu.: 0.06296  
#>  Max.   : 5.2569   Max.   : 2.6667   Max.   : 8.29651   Max.   : 4.74606  
#>      screen            ads         
#>  Min.   :-0.615   Min.   :-2.3880  
#>  1st Qu.:-0.615   1st Qu.:-0.4416  
#>  Median :-0.615   Median : 0.2444  
#>  Mean   : 0.000   Mean   : 0.0000  
#>  3rd Qu.: 0.549   3rd Qu.: 0.6273  
#>  Max.   : 2.877   Max.   : 1.5207

3 Map A : Compress using vector quantization

This package can perform vector quantization using the following algorithms -

  • Hierarchical Vector Quantization using k−means
  • Hierarchical Vector Quantization using k−medoids

For more information on vector quantization, refer the following link.

The HVT function constructs highly compressed hierarchical Voronoi tessellations. The raw data is first scaled and this scaled data is supplied as input to the vector quantization algorithm. The vector quantization algorithm compresses the dataset until a user-defined compression percentage/rate is achieved using a parameter called quantization error which acts as a threshold and determines the compression percentage. It means that for a given user-defined compression percentage we get the ‘n’ number of cells, then all of these cells formed will have a quantization error less than the threshold quantization error.

Let’s try to comprehend the HVT function first before moving ahead.

HVT(
  dataset,
  min_compression_perc,
  n_cells,
  depth,
  quant.err,
  distance_metric = c("L1_Norm", "L2_Norm"),
  error_metric = c("mean", "max"),
  quant_method = c("kmeans", "kmedoids"),
  normalize = TRUE,
  diagnose = FALSE,
  hvt_validation = FALSE,
  train_validation_split_ratio = 0.8
)

Each of the parameters have been explained below :

  • dataset - A dataframe with numeric columns

  • min_compression_perc - An integer indicating the minimum percent compression rate to be achieved for the dataset

  • n_cells - An integer indicating the number of cells per hierarchy (level)

  • depth - An integer indicating the number of levels. (1 = No hierarchy, 2 = 2 levels, etc …)

  • quant.error - A number indicating the quantization error threshold. A cell will only breakdown into further cells if the quantization error of the cell is above the defined quantization error threshold

  • distance_metric - The distance metric can be L1_Norm or L2_Norm. L1_Norm is selected by default. The distance metric is used to calculate the distance between an n dimensional point and centroid. The user can also pass a custom function to calculate this distance

  • error_metric - The error metric can be mean or max. max is selected by default. max will return the max of m values and mean will take mean of m values where each value is a distance between a point and centroid of the cell. Moreover, the user can also pass a custom function to calculate the error metric

  • quant_method - The quantization method can be kmeans or kmedoids. kmeans is selected by default

  • normalize - A logical value indicating whether the columns in your dataset need to be normalized. Default value is TRUE. The algorithm supports Z-score normalization

  • diagnose - A logical value indicating whether user wants to perform diagnostics on the model. Default value is TRUE.

  • hvt_validation - A logical value indicating whether user wants to holdout a validation set and find mean absolute deviation of the validation points from the centroid. Default value is FALSE.

  • train_validation_split_ratio - A numeric value indicating train validation split ratio. This argument is only used when hvt_validation has been set to TRUE. Default value for the argument is 0.8

First, we will construct map A by performing Hierarchical Vector Quantization by setting the below mentioned model parameters.

Model Parameters

  • Number of Cells at each Level = 100
  • Maximum Depth = 1
  • Quantization Error Threshold = 0.46
  • Error Metric = Max
  • Distance Metric = Manhattan
map_A <- list()
map_A <-muHVT::HVT(trainComputers,
                n_cells = 100,
                quant.err = 0.46,
                depth = 1,
                distance_metric = "L1_Norm",
                error_metric = "max",
                quant_method = "kmeans",
                normalize = F
)

As per the manual, map_A[[3]] gives us detailed information about the hierarchical vector quantized data. map_A[[3]][['summary']] gives a nice tabular data containing no of points, Quantization Error and the codebook.

The datatable displayed below is the summary from map A

 
DT::datatable(map_A[[3]]$summary, rownames = FALSE,
              class = 'cell-border stripe',
              options = list(scrollX = TRUE))

Now let us understand what each column in the above table means:

  • Segment.Level - Level of the cell. In this case, we have performed Vector Quantization for depth 1. Hence Segment Level is 1

  • Segment.Parent - Parent segment of the cell

  • Segment.Child (Cell.Number) - The children of a particular cell. In this case, it is the total number of cells at which we achieved the defined compression percentage

  • n - No of points in each cell

  • Cell.ID - Cell_ID’s are generated for the multivariate data using 1-D Sammon’s Projection algorithm

  • Quant.Error - Quantization Error for each cell

All the columns after this will contain centroids for each cell. They can also be called a codebook, which represents a collection of all centroids or codewords.

Now, let’s check the compression summary for HVT (map A). The table below shows no of cells, no of cells having quantization error below threshold and percentage of cells having quantization error below threshold for each level.

mapA_compression_summary <- map_A[[3]]$compression_summary %>%  dplyr::mutate_if(is.numeric, funs(round(.,4)))
DT::datatable(mapA_compression_summary, class = 'cell-border stripe', options = list(scrollX = TRUE))

As it can be seen from the table above, 82% of the cells have hit the quantization threshold error

Now let’s try to understand plotHVT function. The parameters have been explained in detail below

plotHVT(hvt.results, line.width, color.vec, pch1 = 21, centroid.size = 3, title = NULL, maxDepth = 1)
  • hvt.results - A list containing the output of the HVT function which has the details of the tessellations to be plotted

  • line.width - A vector indicating the line widths of the tessellation boundaries for each layer

  • color.vec - A vector indicating the colors of the tessellations boundaries at each layer

  • pch1 - Symbol type of the centroids of the tessellations (parent levels). Refer points (default = 21)

  • centroid.size - Size of centroids of first level tessellations (default = 3)

  • title - Set a title for the plot (default = NULL)

Let’s plot the Voronoi tessellation for layer 1 (map A)

muHVT::plotHVT(map_A,
        line.width = c(0.6), 
        color.vec = c("#141B41"),
        centroid.size = 1.5,
        maxDepth = 1) 

Let’s have a look at the function hvtHmap that we will use to overlay features as heatmap.

hvtHmap(hvt.results, dataset, child.level, hmap.cols, color.vec ,line.width, palette.color = 6)
  • hvt.results - A list of results obtained from the HVT function (map_A)

  • dataset - A dataframe containing the variables to overlay as a heatmap. The user can pass an external dataset or the dataset that was used to perform hierarchical vector quantization. The dataset should have the same number of points as the dataset used to perform hierarchical Vector Quantization in the HVT function

  • child.level - A number indicating the level for which the heat map is to be plotted

  • hmap.cols - The column number of column name from the dataset indicating the variables for which the heat map is to be plotted. To plot the quantization error as heatmap, pass 'quant_error'. Similarly to plot the no of points in each cell as heatmap, pass 'no_of_points' as a parameter

  • color.vec - A color vector such that length(color.vec) = child.level (default = NULL)

  • line.width - A line width vector such that length(line.width) = child.level (default = NULL)

  • palette.color - A number indicating the heat map color palette. 1 - rainbow, 2 - heat.colors, 3 - terrain.colors, 4 - topo.colors, 5 - cm.colors, 6 - BlCyGrYlRd (Blue,Cyan,Green,Yellow,Red) color (default = 6)

  • show.points - A boolean indicating whether the centroids should be plotted on the tessellations (default = FALSE)

Now let’s plot all the features for each cell at level one as a heatmap.

hmap <- list()
col_list <- colnames(trainComputers)
hmap <- lapply(1:length(col_list), function(x){
  hvtHmap(
  map_A,
  scores,
  child.level = 1,
  hmap.cols = col_list[[x]],
  line.width = c(0.2),
  color.vec = c("#141B41"),
  palette.color = 6,
  centroid.size = 1.0,
  show.points = T,
  quant.error.hmap = 0.46,
  n_cells.hmap = 100,
) 
})
grid.arrange(hmap[[1]], nrow = 1, ncol=1)

grid.arrange(hmap[[2]], nrow = 1, ncol=1)

grid.arrange(hmap[[3]], nrow = 1, ncol=1)

grid.arrange(hmap[[4]], nrow = 1, ncol=1)

grid.arrange(hmap[[5]], nrow = 1, ncol=1)

grid.arrange(hmap[[6]], nrow = 1, ncol=1)

4 Map B : Identify the novelties and construct a map on the dataset with novelty

The removeNovelty function removes the identified novelty cell(s) from the dataset and stores those records separately.

It takes input as the cell number (Segment.Child) of the manually identified novelty cell(s) from the above table and the compressed HVT map (map A). It returns a list of two items: dataset with novelty records, and a subset of the dataset without the novelty records.

identified_Novelty_cells <<- c(22, 32, 33)
output_list <- removeNovelty(identified_Novelty_cells, map_A)

[1] “The following cell(s) have been removed as outliers from the dataset: 22 32 33”

data_with_novelty <- output_list[[1]]
dataset_without_novelty <- output_list[[2]]

Note - In the dataset with novelty, the total number of cells would be equal to the total number of novelty cell(s) removed from the HVT map A. Let’s say, as in the above case where three cells (22nd, 32nd, and 33rd) are identified as the novelty and are removed. Then the 22th cell would be the first cell in the HVT map B, 32th as the second cell, and 33th as the third cell.

Let’s have a look at the data with novelties.

colnames(data_with_novelty) <- c("Cell.ID","Segment.Child","price","speed","hd","ram","screen","ads")
DT::datatable(data_with_novelty, rownames = FALSE,
              class = 'cell-border stripe',
              options = list(scrollX = TRUE))

4.1 Voronoi tessellation to highlight novelty cell in the map

The plotCells function is used to plot the Voronoi tessellation using the compressed HVT map (map A) and highlights the identified novelty cell(s) in red on the map.

plotCells(identified_Novelty_cells, map_A)
Figure 3: The Voronoi Tessellation constructed using the compressed HVT map (map A) with the novelty cell(s) highlighted in red

Figure 3: The Voronoi Tessellation constructed using the compressed HVT map (map A) with the novelty cell(s) highlighted in red

We pass the dataframe with novelty records to HVT function along with other model parameters mentioned below to generate map B (layer 2).

Model Parameters

  • Number of Cells at each Level = 3
  • Maximum Depth = 1
  • Quantization Error Threshold = 0.46
  • Error Metric = Max
  • Distance Metric = Manhattan
dataset_with_novelty <- data_with_novelty[,-1:-2]
map_B <- list()
map_B <- muHVT::HVT(dataset_with_novelty,
                  n_cells = 3,
                  depth = 1,
                  quant.err = 0.46,
                  projection.scale = 10,
                  normalize = F,
                  distance_metric = "L1_Norm",
                  error_metric = "max",
                  quant_method = "kmeans",
                  diagnose = F)

The datatable displayed below is the summary from map B (layer 2)

DT::datatable(map_B[[3]]$summary, rownames = FALSE,
              class = 'cell-border stripe',
              options = list(scrollX = TRUE))

5 Map C : Construct a map on the dataset without novelty

Construct another hierarchical Voronoi tessellation map C on the dataset without novelty setting the below mentioned model parameters.

  • Number of Cells at each Level = 100
  • Maximum Depth = 2
  • Quantization Error Threshold = 0.46
  • Error Metric = Max
  • Distance Metric = Manhattan
map_C <- list()
map_C <- muHVT::HVT(dataset_without_novelty,
                  n_cells = 100,
                  depth = 2,
                  quant.err = 0.46,
                  projection.scale = 10,
                  normalize = F,
                  distance_metric = "L1_Norm",
                  error_metric = "max",
                  quant_method = "kmeans",
                  diagnose = F)

The datatable displayed below is the summary from map C (layer 2)

DT::datatable(map_C[[3]]$summary, rownames = FALSE,
              class = 'cell-border stripe',
              options = list(scrollX = TRUE))

Now let’s check the compression summary for HVT (map C). The table below shows no of cells, no of cells having quantization error below threshold and percentage of cells having quantization error below threshold for each level.

mapC_compression_summary <- map_C[[3]]$compression_summary %>%  dplyr::mutate_if(is.numeric, funs(round(.,4)))
DT::datatable(mapC_compression_summary, class = 'cell-border stripe', options = list(scrollX = TRUE))

As it can be seen from the table above, 85% of the cells have hit the quantization threshold error in level 1 and 100% of the cells have hit the quantization threshold error in level 2

Now let’s plot all the features for each cell at level two as a heatmap.

hmap <- list()
col_list <- colnames(trainComputers)
hmap <- lapply(1:length(col_list), function(x){
  hvtHmap(
  map_C,
  scores,
  child.level = 2,
  hmap.cols = col_list[[x]],
  line.width = c(0.6,0.4),
  color.vec = c("#141B41","#0582CA"),
  palette.color = 6,
  centroid.size = 1.0,
  show.points = T,
  quant.error.hmap = 0.46,
  n_cells.hmap = 100,
) #%>% ggplotly() # for plotly
})
grid.arrange(hmap[[1]], nrow = 1, ncol=1)

grid.arrange(hmap[[2]], nrow = 1, ncol=1)

grid.arrange(hmap[[3]], nrow = 1, ncol=1)

grid.arrange(hmap[[4]], nrow = 1, ncol=1)

grid.arrange(hmap[[5]], nrow = 1, ncol=1)

grid.arrange(hmap[[6]], nrow = 1, ncol=1)

6 Prediction on Test Data

Now once we have built the model, let us try to predict using our test dataset which cell and which layer each point belongs to.

The predictLayerHVT function is used to score the test data using the predictive set of maps. This function takes an input - a test data and a set of maps (map A, map B, map C).

Let’s have a look at our scaled test dataset.

# Quick peek
testComputers <- scale(testComputers, center = scale_attr$`scaled:center`, scale = scale_attr$`scaled:scale`) %>% as.data.frame()
DT::datatable(head(testComputers), rownames = FALSE,
              class = 'cell-border stripe',
              options = list(scrollX = TRUE))

Now let us check the structure of the test data and analyse its summary

str(testComputers)
#> 'data.frame':    1253 obs. of  6 variables:
#>  $ price : num  -1.228 1.383 -0.802 0.23 0.308 ...
#>  $ speed : num  -0.783 0.092 0.092 2.667 0.916 ...
#>  $ hd    : num  -0.676 3.063 -0.676 -0.41 1.731 ...
#>  $ ram   : num  -0.718 3.185 -0.718 -0.718 1.624 ...
#>  $ screen: num  0.549 0.549 -0.615 -0.615 0.549 ...
#>  $ ads   : num  -0.84 -0.84 -0.84 -0.84 -0.84 ...
summary(testComputers)
#>      price             speed               hd               ram          
#>  Min.   :-2.2216   Min.   :-0.7834   Min.   :-1.0995   Min.   :-1.10781  
#>  1st Qu.:-1.0368   1st Qu.: 0.0920   1st Qu.: 0.3420   1st Qu.: 0.06296  
#>  Median :-0.6167   Median : 0.9159   Median : 0.8986   Median : 0.06296  
#>  Mean   :-0.4276   Mean   : 0.9755   Mean   : 1.4372   Mean   : 0.60148  
#>  3rd Qu.: 0.1228   3rd Qu.: 1.3793   3rd Qu.: 2.3497   3rd Qu.: 1.62400  
#>  Max.   : 2.8957   Max.   : 2.6667   Max.   : 8.2965   Max.   : 4.74606  
#>      screen             ads         
#>  Min.   :-0.6150   Min.   :-3.2654  
#>  1st Qu.:-0.6150   1st Qu.:-1.8296  
#>  Median : 0.5490   Median :-1.4627  
#>  Mean   : 0.4672   Mean   :-1.7811  
#>  3rd Qu.: 0.5490   3rd Qu.:-1.2872  
#>  Max.   : 2.8768   Max.   : 0.7708

Now, Let us understand the predictLayerHVT function -

predictLayerHVT(data,
                map_A,
                map_B,
                map_C,
                mad.threshold = 0.2,
                normalize = T, 
                distance_metric="L1_Norm",
                error_metric="max",
                child.level = 1, 
                line.width = c(0.6, 0.4, 0.2),
                color.vec = c("#141B41", "#6369D1", "#D8D2E1"),
                yVar= NULL,
                ...)

Each of the parameters of predictLayerHVT function has been explained below:

  • data - A dataframe containing the test dataset. The dataframe should have atleast one variable used for training. The variables from this dataset can also be used to overlay as heatmap

  • map A - A list of hvt.results.model obtained from HVT function while performing hierarchical vector quantization on train data

  • map B - A list of hvt.results.model obtained from HVT function while performing hierarchical vector quantization on train data with novelty

  • map C - A list of hvt.results.model obtained from HVT function while performing hierarchical vector quantization on train data without novelty

  • child.level - A number indicating the layer for which the heat map is to be plotted.(Only used if hmap.cols is not NULL)

  • mad.threshold - A numeric values indicating the permissible Mean Absolute Deviation

  • normalize - A logical value indicating if the columns in your dataset should be normalized. Default value is TRUE.

  • distance_metric - The distance metric can be ’Euclidean” or “Manhattan”. Euclidean is selected by default.

  • error_metric - The error metric can be “mean” or “max”. mean is selected by default

  • yVar - Name of the dependent variable(s)

  • ... - color.vec and line.width can be passed from here

The function predicts based on the HVT maps - map A, map B and map C, constructed using HVT function. For each test record, the function will assign that record to Layer1 or Layer2. Layer1 contains the cell ids from map A and Layer 2 contains cell ids from map B (novelty map) and map C (map without novelty).

Note : The prediction algorithm will not work if some of the variables used to perform quantization are missing. In the test dataset, we should not remove any features

Let’s see which cell and layer each point belongs to.

validation_data <- testComputers
new_predict <- predictLayerHVT(
    data=validation_data,
    map_A,
    map_B,
    map_C,
    normalize = F
  )
DT::datatable(new_predict, rownames = FALSE,
              class = 'cell-border stripe',
              options = list(scrollX = TRUE))

Note: From the above table, we can see that 954th observation from the test data having Cell.ID as A99 have been identified as novelty and is mapped to 1st(B1) novelty cell in Layer2.Cell.ID column .Similarly, 390th,413th,421th,432nd,530th……. test record having Cell.ID as A94 is correctly identified as novelty and gets mapped to 3rd(B3) novelty cell in the Layer2.Cell.ID column respectively

7 Executive Summary

  • We have considered computers dataset for creating a predictive set of maps to monitor entities over time using predictLayerHVT() in this notebook

  • We construct a compressed HVT map (Map A) using the HVT() on the training dataset by setting n_cells to 100 and quant.error to 0.46

  • Based on the output of the above step, we manually identify the novelty cell(s) from the plotted map A. For this dataset, we identify the 22nd, 32nd and 33rd cells as the novelty cell.

  • We pass the identified novelty cell(s) as a parameter to the removeNovelty() along with HVT map A. The function removes that novelty cell(s) from the dataset and stores them separately. It also returns the dataset without novelty(s).

  • The plotCells() constructs hierarchical voronoi tessellations and highlights the identified novelty cell(s) in red

  • The dataset with novelty is then passed to the HVT() to construct another HVT map (map B). But here, we set the parameters n_cells = 3, depth = 1 etc. when constructing the map

  • The dataset without novelties is then passed to the HVT() to construct another HVT map (map C). But here, we set the parameters n_cells = 100, depth = 2 etc. when constructing the map.

  • Finally, the set of maps - map A, map B, and map C are passed to the predictLayerHVT() along with the test dataset to predict which map and what cell each test record is assigned to.

  • The output of predictLayerHVT is a dataset with two columns Layer1.Cell.ID and Layer2.Cell.ID. Layer1.Cell.ID contains cell ids from map A in the form A1,A2,A3…. and Layer2.Cell.ID contains cell ids from map B as B1,B2… depending on the identified novelties and map C as C1,C2,C3…..

  • From the output of predictLayerHVT we can see that 954th observation from the test data having Cell.ID as A99 have been identified as novelty and is mapped to 1st(B1) novelty cell in Layer2.Cell.ID column .Similarly, 390th,413th,421th,432nd,530th……. test record having Cell.ID as A94 is correctly identified as novelty and gets mapped to 3rd(B3) novelty cell in the Layer2.Cell.ID column respectively .

8 References

  1. Topology Preserving Maps : https://link.springer.com/chapter/10.1007/1-84628-118-0_7

  2. Vector Quantization : https://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-450-principles-of-digital-communications-i-fall-2006/lecture-notes/book_3.pdf

  3. K-means : https://en.wikipedia.org/wiki/K-means_clustering

  4. Sammon’s Projection : http://en.wikipedia.org/wiki/Sammon_mapping

  5. Voronoi Tessellations : http://en.wikipedia.org/wiki/Centroidal_Voronoi_tessellation